An analytic perspective on Income, race and Drugs use.#
Kiefer Plender, Remolo van de Plassen, Ouail Moukthari, Huub Al
1.1 Introduction#
Drug abuse is a hard and intricate issue affecting big parts of modern society. Stepping away from bias and stereotypes, our data story wishes to provide a clear overview of drug abuse. Presenting two distinct perspectives on drug abuse, trying to provide a wide view of the topic.
Our first perspective investigates whether or not individuals that have a lower income and belong to a racial minority group are more likely to abuse illicit drugs. Following the narrative that these people have more challenges in day-to-day life, such as financial problems or fewer job opportunities. Due to the nature of drugs (specifically downers), we think these people might pick up drug habits to deal with these problems earlier than more well-off individuals. The second perspective suggests a broader view of the overall topic. It states that drug use is a universal problem and factors like race or income do not play a direct role. Individuals with lower incomes may be more vulnerable to drug abuse, but low income isn’t the only factor that contributes to this statistic. Our data study relies on the notion that we can attribute the issue to more general factors, like peer pressure or general sensitivity to addiction.
When reviewing these two perspectives, we aim to present a more nuanced view on drug abuse and its victims. Challenging the current stereotypes and stigmas associated with drug abuse can create a society that is educated and supports victims affected by this issue Livingston, Milne, Fang, & Amari, 2012.
1.2 Dataset and preprocessing#
In pursuit of providing a clear overview, we decided to use a large dataset from the 2015 National Survey on Drug Use and Health. The survey captures a representative general view of the USA adult population. Due to the overall completeness and significant amount of variables the data story will be solely based on this dataset, and the necessary academic papers to support our findings.
Fortunately, the dataset contained very clear data that didn’t require much pre-processing to be usable. However, due to it being survey data the findings were of the binary type and needed to be translated to their corresponding real-world values. We had to utilise the Legenda to provide a more intuitive interpretation. As such we converted variables like sex which have a value of 1 or 2, to the corresponding nominal values like ‘Male’ or ‘Female’. Other than this process of translating there wasn’t much need for preprocessing for the creating the figures.
1.3 Visualisations#
Import of packages and reading our dataset#
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
df = pd.read_csv('nsduh_workforce_adults.csv')
First visualisation ( Bar Plot: Drug usage by race and sex):#
This bar chart plot describes the average drug usage rate grouped by race and sex. The x-axis denotes the drug usage in % and the y-axis different race groups. For each race group, there is a further diversification based on sex, which in this case is either Male or Female. Specifying the data point towards Male or Female is due to gender being a possible contribution to minority or prejudice. It’s clear some races generally have higher drug usage, but this is not the main takeaway of this plot. Looking at the proportions of Male drug users to female drug users is the main interest of this plot. You can observe for Asian and Mixed groups there is not many differences per sex But for races the Black/African American race there is a big difference in % Using articles we plan to attribute these differences to a combination of culture and or sex. This will help shape a more inclusive view of drug abuse and help our other findings take a more concrete shape.
df = pd.read_csv('nsduh_workforce_adults.csv')
df_grouped = df.groupby(['race_str', 'sex'])['anydrugever'].mean().reset_index()
df_grouped.sort_values('race_str', inplace=True)
races = df_grouped['race_str'].unique()
male_df = df_grouped[df_grouped['sex'] == 1]
female_df = df_grouped[df_grouped['sex'] == 2]
trace1 = go.Bar(x=races, y=male_df['anydrugever'].values * 100, name='Male')
trace2 = go.Bar(x=races, y=female_df['anydrugever'].values * 100, name='Female')
layout = go.Layout(
title='Drug Usage by Race and Sex',
xaxis=dict(title='Race'),
yaxis=dict(title='Drug Usage (%)', dtick=10),
barmode='group'
)
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.add_annotation(
xref="paper",
yref="paper",
x=0.5,
y=-0.2,
xanchor="center",
text="Helped by the GPT-4 prompt: Help me to create a bar plot to show the Drug Usage by Race and Sex (in %) for race and drug use with Plotly, use arbirtary column names. 17-6-23",
showarrow=False,
font=dict(size=10)
)
fig.show()
Second visualisation ( Heat map: Percentage of Drug Use (Ever) by Race ):#
This plot shows the percentage of people of different ethnicities that ever used a certain type of drug. On the y-axis, are the different types of ethnicities, and on the x-axis are different types of drugs. This plot shows that marijuana is by far the drug that most people have ever tried, and crack and heroin are the drug that the least people have ever used. Native Americans seem to use some types of drugs the most out of all races: cocaine, crack, hallucinogen, inhalant, meth, and tranquilizers. According to a medically reviewed article by the American Addiction Center, this is a well-known problem among Native Americans. It could potentially be explained by historical trauma, violence (including high levels of gang violence, domestic violence, and sexual assault), poverty, high levels of unemployment, discrimination, racism, lack of health insurance, or low levels of attained education (Substance Abuse Statistics for Native Americans, 2022). Another finding is that Asian people have tried a lot fewer drugs than other races.
df = pd.read_csv('NSDUH_Workforce_Adults.csv')
variables = ['marij_ever', 'cocaine_ever', 'crack_ever', 'heroin_ever', 'hallucinogen_ever',
'inhalant_ever', 'meth_ever', 'painrelieve_ever', 'tranq_ever', 'stimulant_ever']
full_names = {
'marij_ever': 'Marijuana',
'cocaine_ever': 'Cocaine',
'crack_ever': 'Crack',
'heroin_ever': 'Heroin',
'hallucinogen_ever': 'Hallucinogen',
'inhalant_ever': 'Inhalant',
'meth_ever': 'Methamphetamine',
'painrelieve_ever': 'Pain Reliever',
'tranq_ever': 'Tranquilizer',
'stimulant_ever': 'Stimulant'
}
total_counts = df['race_str'].value_counts()
counts = df.groupby('race_str')[variables].sum()
counts = counts.rename(columns=full_names)
proportions = counts.div(total_counts, axis=0) * 100
proportions = proportions.round(2)
fig = px.imshow(proportions, labels=dict(x="Type of drug", y="Race", color="Percentage"),
title="Percentage of Drug Use (Ever) by Race", color_continuous_scale='YlOrRd',
zmin=0, zmax=100)
annotations = []
for i in range(len(proportions)):
for j in range(len(proportions.columns)):
annotations.append(dict(
x=j,
y=i,
text=str(proportions.iloc[i, j]) + '%',
showarrow=False,
font=dict(color='black', size=8)
))
fig.update_layout(annotations=annotations)
fig.update_xaxes(side="top")
fig.add_annotation(
xref="paper",
yref="paper",
x=0.5,
y=-0.2,
xanchor="center",
text="Helped by the GPT-4 prompt: Help me to create a heatmap plot to show the proportions for race and drug use with Plotly. 18-6-23",
showarrow=False,
font=dict(size=10)
)
fig.show()
Third visualisation ( Correlation Plot: Income, Education, and Drugs):#
Our expectations beforehand were that people with lower incomes are more likely to use drugs based on their economic and social circumstances. However, something else appears to emerge from the correlation plot based on our data. First, we only looked at the correlation between ‘countofdrugs_ever’ and ‘personal income’, ‘family income’, and education. However, we soon found that there was no correlation. We thought this might be due to the data. That is why we finally added ‘countofdrugs_month’ and ‘countofdrugs_year’ to see if our findings that we made in the beginning are correct. As can be seen from the correlation plot, there is no clear correlation between drug use and income and education.
df = pd.read_csv('nsduh_workforce_adults.csv')
columns = ['PersonalIncome', 'FamilyIncome', 'education', 'countofdrugs_ever', 'countofdrugs_month', 'countofdrugs_year']
selected_data = df[columns]
correlation_matrix = selected_data.corr()
fig = px.imshow(correlation_matrix.loc[['countofdrugs_ever', 'countofdrugs_month', 'countofdrugs_year'], :],
labels=dict(color="Correlation"), color_continuous_scale='YlOrRd')
fig.update_layout(
title='Correlation Plot: Income, Education, and Drugs',
annotations=[
dict(
x=0.5,
y=-0.35,
xref='paper',
yref='paper',
text="Helped by the ChatGPT prompt: Maak een willekeurige correlatie plot gebaseerd op 4 verschillende data die ik zelf moet invoeren. 20-6-23",
showarrow=False,
font=dict(size=10)
)
],
margin=dict(l=50, r=50, t=50, b=120)
)
fig.show()
Fourth visualisation ( Parallel coordinates Plot: Income, Education, and Drugs):#
We were hoping for a more clear visualization of the connection between income and drug use via a parallel categories plot. The idea was that we could maybe visualize the most common combinations of socio-economic factors, such as education and income, that lead to a higher use of drugs. The first plot here is to show the combinations of factors for all people. This graph is not very significant since obviously most people don’t use a lot of drugs at all. The bins in the graph are the following: Low = 0-3 different drugs ever used, medium = 4 - 6 drugs and high = 7+. In the second graph only the high and medium groups are shown. In this shows the desired visualization. But just like the last section, the combinations of variables leading to a high variety in drug-usage seem to be very random and not related to income or education at all.
df = pd.read_csv('nsduh_workforce_adults.csv')
# Column names
columns = ['race_str', 'PersonalIncome', 'education', 'countofdrugs_ever', 'FamilyIncome']
# Create DataFrame
df = pd.DataFrame(df, columns=columns)
# Using qcut
df['amount_drugs_qcut'], qcut_bins = pd.cut(df['countofdrugs_ever'], bins=3, labels=['Low', 'Medium','High'], retbins=True)
print("Bins for qcut:", qcut_bins)
# filter rows with only high and medium drug use.
df_filtered = df[df['amount_drugs_qcut'].isin(['Medium', 'High'])]
# Create Parallel Categories plot
parcatsall = go.Figure(data=[go.Parcats(dimensions=[
{'label': 'Personal Income', 'values': df['PersonalIncome'], 'categoryorder': 'category ascending'},
{'label': 'Education', 'values': df['education'], 'categoryorder': 'category ascending'},
{'label': 'Family Income', 'values': df['FamilyIncome'], 'categoryorder': 'category ascending'},
{'label': 'Drug Use', 'values': df['amount_drugs_qcut']},
],
line={'color': df['amount_drugs_qcut'].map({'Low': 'lightblue','Medium': 'lightgreen', 'High': 'orangered'})},
labelfont={'size': 12},
tickfont={'size': 12},
arrangement='freeform'
)],
layout={'title': 'Analysis of Income, Education, and Drug Use'})
parcatsall.add_annotation(
x=0.5,
y=-0.1,
xref='paper',
yref='paper',
text='Helped by the ChatGPT prompt: Create a sample of a parallel categories graph with 4 variables. 19-6-23',
showarrow=False,
font=dict(size=10)
)
# Show plot
parcatsall.show()
# Create Parallel Categories plot
parcats = go.Figure(data=[go.Parcats(dimensions=[
{'label': 'Personal Income', 'values': df_filtered['PersonalIncome'], 'categoryorder': 'category ascending'},
{'label': 'Education', 'values': df_filtered['education'], 'categoryorder': 'category ascending'},
{'label': 'Family Income', 'values': df_filtered['FamilyIncome'], 'categoryorder': 'category ascending'},
{'label': 'Drug Use', 'values': df_filtered['amount_drugs_qcut']},
],
line={'color': df_filtered['amount_drugs_qcut'].map({'Medium': 'lightgreen', 'High': 'orangered'})},
labelfont={'size': 12},
tickfont={'size': 12},
arrangement='freeform'
)],
layout={'title': 'Analysis of Income, Education, and Drug Use'})
parcats.add_annotation(
x=0.5,
y=-0.1,
xref='paper',
yref='paper',
text='Helped by the ChatGPT prompt: Create a sample of a parallel categories graph with 4 variables. 19-6-23',
showarrow=False,
font=dict(size=10)
)
# Show plot
parcats.show()
Bins for qcut: [-0.01 3.33333333 6.66666667 10. ]